Broad misappropriation of developmental splicing profile by cancer in multiple organs

Oncogenesis mimics key aspects of embryonic development. However, the underlying mechanisms are incompletely understood. Here, we demonstrate that the splicing events specifically active during human organogenesis, are broadly reactivated in the organ-specific tumor. Such events are associated with key oncogenic processes and predict proliferation rates in cancer cell lines as well as patient survival. Such events preferentially target nitrosylation and transmembrane-region domains, whose coordinated splicing in multiple genes respectively affect intracellular transport and N-linked glycosylation. We infer critical splicing factors potentially regulating embryonic splicing events and show that such factors are potential oncogenic drivers and are upregulated specifically in malignant cells. Multiple complementary analyses point to MYC and FOXM1 as potential transcriptional regulators of critical splicing factors in brain and liver. Our study provides a comprehensive demonstration of a splicing-mediated link between development and cancer, and suggest anti-cancer targets including splicing events, and their upstream splicing and transcriptional regulators.

Fisher's test for the enrichment of EP and EN events among the cancer-specific events after removing all the EP events which had near zero inclusion in normal GTEx samples (i.e., PSI < 0.05 in > 80% of the samples) Shown are the FDR adjusted two-sided p-values from Fisher's exact test. f Scatter plot for the "pre-natal -post natal" ∆PSI of splicing events (x-axis) against their "TCGA -GTEx" ∆PSI in cancer. Blue lines and shaded grey areas respectively depict the best fitting lines and 95% confidence intervals based on linear regression, Pearson's correlation coefficients and two-sided p-values are shown in the plots. g Bar plots showing the odds ratio of enrichment/depletion for the EP and EN events among the cancer-specific events. The cancer-specific events were identified using a Wilcoxon's test to assess the differential inclusion (|∆PSI| > 0.2 & FDR < 0.1) of each exon in TCGA relative to GTEx. The FDR adjusted two-sided p-values of enrichment (odds ratio) are shown. At a stringent FDR threshold of odds ratio (FDR <= 0.01), a total of 10 out of 12 comparisons show the expected trend, i.e., the enrichment of EP/EN events among events increased/decreased in cancers and depletion of EP/EN events among the events decreased/increased in cancers. At relaxed FDR threshold of 0.25, 11/12 comparisons were significant. h, i Odds ratio for the enrichment of EP and EN events among frequently increased and decreased events in the early and late-stage cancers, showing the greater reversion to embryonic splicing in advanced stage cancers. In G-I, Odds ratio was calculated using the Fisher's exact test and numbers near each bar are the FDR adjusted two-sided p-values of the Fisher's test. J Hazard ratio for EP and EN exons separated into three subclasses; namely negative (n = 1309 for Brain EN and 853 for Brain EP, n = 480 for Kidney EN and n = 401 for Kidney EP), positive (n = 917 for Brain EN and n = 1098 for Brain EP, n = 331 for Kidney EN and n = 439 for Kidney EP), and uncorrelated (n = 1230 for Brain EN and n = 1098 for Brain EP, n = 529 for Kidney EN and n = 576 for Kidney EP) depending on the correlation of their inclusion level with the expression level of host gene, showing that hazard ratios of EP and EN exons is independent of the expression of their host gene. In boxplots, the horizonal line in the middle is the median value and the lower and upper edges of the boxes correspond to the 25th and 75th percentiles. Extending vertically upwards/downwards of the boxes are the lines showing 1.5 times the interquartile range (i.e., distance between 25 th and 75 th percentile). Dots are the outliers. Source data for these figures are provided as a Source Data file.

Supplementary Figure 3
Supplementary Figure 3 a Heatmap for the GO term (molecular function) enrichment analysis of host genes of EP and EN events in three tissues. The colors in heat cell correspond to -log10 of FDR corrected one-sided p-value of enrichment from Fisher's test as implemented in clusterProfiler library in R. b Heatmap for the overlap among the gene sets enriched for WD40, nitrosylation and transmembrane-region domains across EP and EN events in three tissues. For each domain (i.e., the three plots), the numbers along the diagonal indicate the number of genes having that specific domain and off-diagonal entries show the number of common host genes of EP/EN events across tissues. This plot emphasizes that for a given domain, the observed enrichment of these three domains (Fig. 3a) is driven by different set of genes. The colors in the heatmap codes the value written in each cell. c Correspondence between molecular functions and biological processes for transmembrane domains in liver EP events. Source data for these figures are provided as a Source Data file.     non-significant (FDR > 0.05, not-critical group in the plot, (n = 241 for brain, n = 387 for kidney, and n = 200 for liver), and significantly (FDR < 0.05) positive (Positive group in the plot, n = 119 for brain, n = 45 for kidney, and n = 167 for liver), negative (Negative group in the plot, n = 81 for brain, n = 9 for kidney, and n = 74 for liver) regression coefficients. d Phenotypic consequences of deletion/knockdown of critical (CSF) and non-critical (nCSFs) splicing factor orthologs in mice. Bar plots show the fraction of CSFs (red) and nCSFs (blue) that results in developmental phenotypic defects (shown next to bar plots) in mice. The deletion of brain CSFs results in abnormal nervous development. CSFs from all three tissues are much more likely to result in pre-weaning lethality as compared to CSFs. Phenotypes shown here had an enrichment (OR > 1) of CSFs as compared to nCSFs at a two-sided p-value threshold of 0.1 and FDR of 0.30 from Fisher's exact test. For this analysis, we complied the list of phenotypes associated with genetic deletion screens in mice from mouse genome informatics database (http://www.informatics.jax.org/). We manually curated the phenotypes which resulted in lethality during pre-natal stage or in developmental abnormalities in brain, kidney, and liver. The complete set of phenotypes and their associated genes is given in Supplementary

Computational drug repositioning
We performed virtual screening process using Autodock Vina 14 . The 3D structure of the FOXM1 and MYC proteins were downloaded from the RCSB-PDB 15 and further refinement was done using Open Babel software 16

Supplementary note 1 -Use of PEGASAS
In this investigation, we chose to use the previously proposed PEGASAS 1 approach to identify the splicing events whose PSI value co-varied with the expression of embryonic pathways (Fig. 1A/B). The rationale underlying PEGASAS is that the critical molecular entities underlying the phenotypic variations across disease, development, and ageing, co-vary with the biological processes and pathways linked with the specific phenotypic states. Therefore, the task of identifying the molecular changes associated with a phenotype is reduced to first identifying the processes and pathways linked to the phenotype and then in a subsequent step identifying molecular changes co-varying with those processes and pathways. This concept has been successfully applied to investigate alternative splicing in prostate, breast, and lung cancer datasets by Yi Xing's group 1 .
While the use of differential analysis, which relies on assessing the significance of difference between two biological conditions/states (for instance embryonic vs. adult) is appealing in its directness, the functional interpretation of the resulting gene list (exons in our case) is challenging and requires additional steps of gene set analysis, which can be unstable 2,3 and heavily depends on the choice of software 4 .
In contrast, in our approach, we first identify hundreds of KEGG pathways which show preferentially high activity during the pre-natal stage of the development (Fig 1B) and select the splicing events which exhibit a significant co-variation with those embryonic pathways. Such splicing events, by virtue of their preferential correlation with embryonic pathways, have a more straightforward interpretation which does not suffer from aforementioned pitfalls for their functional interpretation. To sum up, this approach enables us to identify the exons which are: a. preferentially embryonic in nature to begin with, and importantly, b. are correlated amongst each other, i.e., they change in coordinated fashion across developmental timepoints, which is consistent with previous publications showing the coordinated change of several hundred to thousands of exons in response to diverse biological signals and contexts [5][6][7] . Also, the multivariate structure of exon inclusions has been successfully used recently to discover the sQTLs across GTEx tissues 8 , further supporting our correlation based approach.
Below, we show that the PEGASAS based approach is superior in delivering these goals as compared to the conventional approach of differential splicing. For differential splicing, we used Wilcoxon test followed by FDR correction as has been performed previously 9 and called an event to be embryonic positive if DPSI (prenatal -postnatal PSI) values were > 0.2 with an FDR <= 0.05 in developing brain. We intersected this new set of embryonic events with our pathway-based EP events and obtained unique EP events in our approach (referred here to as pathway-only, pathway-based events that additionally qualified the differential splicing criteria above (referred to as pathway+wilcox), and the unique Wilcoxon based events (referred here to as wilcox-only).
First, we assessed the coordination in splicing within each of the three sets of exons by calculating the within-group pair-wise Pearson correlation coefficients among their inclusion level across timepoints. As shown in the boxplots below, we observed that pathway-based events (even the unique ones) are significantly more correlated with each other than the wilcox-only group. The coordinated nature of splicing events was an important aspect of our study which has been relatively neglected so far in the field of cancer biology and which helped us elucidate the important associations of the coordinated embryonic splicing events with cellular functions such as N-linked protein glycosylation and retrograde transport (Fig 3,   7 and S6).
Moreover, while the events identified by the pathway approach (as shown in Fig. 1C) and even the events unique to pathway-based approach (as shown below) were enriched for several GO terms related to the embryonic development of brain, notably, the events unique to differential events were not enriched for any functional category as shown below.
Finally, we found that the recapitulation of embryonic splicing in cancer holds up even in the pathway-only events but exhibits a much weaker trend for wilcox-only events that was statistically significant in only one of the three tissues.

Pathway-only
None of the term is functionally enriched

wilcox-only
Taken together, these analyses suggest that pathway-based approach is more effective in detecting the coordinated set of context-specific events that are better recapitulated in cancer as well as provides a more direct functional interpretation of splicing events correlated.
Supplementary note 2 -Assessment of embryonic reversal of cancer splicing using pseudo-alignment free approach detectionMethod oddsRatio brain using a pseudoalignment free approach using the STAR 2-pass strategy 12 to generate the genomic alignments for each of the transcriptomic datasets. We then used rMATs 13 to quantify the inclusion level of exon skip events in these three cohorts and assessed the embryonic reversal of cancer splicing. Off note, we observed a high across sample correlation between the inclusion level estimates derived from these two pipelines (median correlation coefficient 0.84).
Furthermore, as shown in the scatter plot below, we observed that the cancer associated changes in the inclusion level of exon skip events (i.e., TCGA -GTEx DPSI) was significantly positively correlated with development associated changes (i.e., Prenatal -Postnatal DPSI).
Further, using this new splicing data generated with rMATs, we defined a new set of embryonic splicing events (namely, EP for embryonic positive and EN for embryonic negative) by using Wilcoxon's rank sum test (absolute median inclusion difference > 0.2 and FDR < 0.1) between pre-natal and post-natal inclusion level of the exon skip events.
We compared this set of the embryonic splicing events, using Fisher's exact test, with the cancer specific events (namely 'Increased' for events with increased inclusion and 'decreased' for events with decreased inclusion in TCGA cancer samples as compared to GTEx normal samples) which were derived by using a similar approach. We observed a very strong and significant enrichment of the EP events among the events increased in cancer (and vice-versa for EN events), as shown in the bar-plots below.
These results indicate that conclusions generated using our Kallisto + SUPPA2 pipeline are statistically robust and can be re-capitulated using independent data processing methods.